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Abstract —Attributes act as intermediate representations that enable parameter sharing between classes, a must when training 
data is scarce. We propose to view attribute-based image classification as a label-embedding problem: each class is embedded in 
the space of attribute vectors. We introduce a function that measures the compatibility between an image and a label embedding. 
The parameters of this function are learned on a training set of labeled samples to ensure that, given an image, the correct 
classes rank higher than the incorrect ones. Results on the Animals With Attributes and Caltech-UCSD-Birds datasets show that 
the proposed framework outperforms the standard Direct Attribute Prediction baseline in a zero-shot learning scenario. Label 
embedding enjoys a built-in ability to leverage alternative sources of information instead of or in addition to attributes, such as 
e.g. class hierarchies or textual descriptions. Moreover, label embedding encompasses the whole range of learning settings from 
zero-shot learning to regular learning with a large number of labeled examples. 

Index Terms —Image Classification, Label Embedding, Zero-Shot Learning, Attributes. 
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1 Introduction 

We consider the image classification problem where the 
task is to annotate a given image with one (or multiple) 
class label(s) describing its visual content. Image classifi¬ 
cation is a prediction task: the goal is to learn from a labeled 
training set a function f \ X which maps an input x in 
the space of images A' to an output y in the space of class 
labels y. In this work, we are especially interested in the 
case where classes are related {e.g. they all correspond to 
animals), but where we do not have any (positive) labeled 
sample for some of the classes. This problem is generally 
referred to as zero-shot learning [18], [30], [31], [43]. Given 
the impossibility to collect labeled training samples in an 
exhaustive manner for all possible visual concepts, zero- 
shot learning is a problem of high practical value. 

An elegant solution to zero-shot learning, called attribute- 
based learning, has recently gained popularity in computer 
vision. Attribute-based learning consists in introducing an 
intermediate space A referred to as attribute layer [18], 
[30]. Attributes correspond to high-level properties of the 
objects which are shared across multiple classes, which 
can be detected by machines and which can be understood 
by humans. Each class can be represented as a vector 
of class-attribute associations according to the presence 
or absence of each attribute for that class. Such class- 
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Fig. 1. Much work in computer vision has been de¬ 
voted to image embedding (left): how to extract suitable 
features from an image. We focus on label embedding 
(right): how to embed class labels in a Euclidean 
space. We use side information such as attributes for 
the label embedding and measure the “compatibility”’ 
between the embedded inputs and outputs with a 
function F. 

attribute associations are often binary. As an example, if the 
classes correspond to animals, possible attributes include 
“has paws”, “has stripes” or “is black”. For the class 
“zebra”, the “has paws” entry of the attribute vector is zero 
whereas the “has stripes” would be one. The most popular 
attribute-based prediction algorithm requires learning one 
classifier per attribute. To classify a new image, its attributes 
are predicted using the learned classifiers and the attribute 
scores are combined into class-level scores. This two-step 
strategy is referred to as Direct Attribute Prediction (DAP) 
in [30]. 

DAP suffers from several shortcomings. First, DAP 
proceeds in a two-step fashion, learning attribute-specific 
classifiers in a first step and combining attribute scores into 
class-level scores in a second step. Since attribute classi¬ 
fiers are learned independently of the end-task the overall 
strategy of DAP might be optimal at predicting attributes 
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but not necessarily at predicting classes. Second, we would 
like an approach that can perform zero-shot prediction if 
no labeled samples are available for some classes, but that 
can also leverage new labeled samples for these classes 
as they become available. While DAP is straightforward 
to implement for zero-shot learning problems, it is not 
straightforward to extend to such an incremental learning 
scenario. Third, while attributes can be a useful source 
of prior information, they are expensive to obtain and 
the human labeling is not always reliable. Therefore, it is 
advantageous to seek complementary or alternative sources 
of side information such as class hierarchies or textual 
descriptions (see section 4). It is not straightforward to 
design an efficient way to incorporate these additional 
sources of information into DAP. Various solutions have 
been proposed to address each of these problems separately 
(see section 2). However, we do not know of any existing 
solution that addresses all of them in a principled manner. 

Our primary contribution is therefore to propose such a 
solution by making use of the label embedding framework. 
We underline that, while there is an abundant literature in 
the computer vision community on image embedding (how 
to describe an image) much less work has been devoted 
in comparison to label embedding in the y space (how 
to describe a class). We embed each class y ^ y m the 
space of attribute vectors and thus refer to our approach 
as Attribute Label Embedding (ALE). We use a structured 
output learning formalism and introduce a function which 
measures the compatibility between an image x and a label 
y (see Figure 1). The parameters of this function are learned 
on a training set of labeled samples to ensure that, given an 
image, the correct class(es) rank higher than the incorrect 
ones. Given a test image, recognition consists in searching 
for the class with the highest compatibility. 

Another important contribution of this work is to show 
that our approach extends far beyond the setting of 
attribute-based recognition: it can be readily used for any 
side information that can be encoded as vectors in order to 
be leveraged by the label embedding framework. 

Label embedding addresses in a principled fashion the 
three limitations of DAP that were mentioned previously. 
First, we optimize directly a class ranking objective, 
whereas DAP proceeds in two steps by solving intermediate 
problems. We show experimentally that ALE outperforms 
DAP in the zero-shot setting. Second, if available, labeled 
samples can be used to learn the embedding. Third, other 
sources of side information can be combined with attributes 
or used as alternative source in place of attributes. 

The paper is organized as follows. In Sec. 2-3, we review 
related work and introduce ALE. In Sec. 4, we study 
extensions of label embedding beyond attributes. In Sec. 5, 
we present experimental results on Animals with Attributes 
(AWA) [30] and Caltech-UCSD-Birds (CUB) [63]. In par¬ 
ticular, we compare ALE with competing alternatives, using 
the same side information i.e. attribute-class associations 
matrices. 

A preliminary version of this article appeared in [1]. 
This version adds (1) an expanded related work section; 


(2) a detailed description of the learning procedure for 
ALE; (3) additional comparisons with random embed¬ 
dings [14] and embeddings derived automatically from 
textual corpora [40], [20]; (4) additional zero-short learning 
experiments, which show the advantage of using continuous 
embeddings; and (5) additional few-shots learning experi¬ 
ments. 

2 Related work 

We now review related work on attributes, zero-shot 
learning and label embedding, three research areas which 
strongly overlap. 

2.1 Attributes 

Attributes have been used for image description [19], 
[18], [9], caption generation [27], [41], face recognition 
[29], [51], [10], image retrieval [28], [56], [15], action 
recognition [32], [69], novelty detection [62] and object 
classification [30], [18], [64], [65], [34], [54], [38]. Since 
our task is object classification in images, we focus on the 
corresponding references. 

The most popular approach to attribute-based recognition 
is the Direct Attribute Prediction (DAP) model of Lamport 
et al. which consists in predicting the presence of attributes 
in an image and combining the attribute prediction proba¬ 
bilities into class prediction probabilities [30]. A significant 
limitation of DAP is the fact that it assumes that attributes 
are independent from each other, an assumption which 
is generally incorrect (see our experiments on attribute 
correlation in section 5.3). Consequently, DAP has been 
improved to take into account the correlation between 
attributes or between attributes and classes [64], [65], 
[71], [34]. However, all these models have limitations of 
their own. Wang and Forsyth [64] assume that images 
are labeled with both classes and attributes. In our work 
we only assume that classes are labeled with attributes, 
which requires significantly less hand-labeling of the data. 
Mahajan et al. [34] use transductive learning and, therefore, 
assume that the test data is available as a batch, a strong 
assumption we do not make. Yu and Aloimonos’s topic 
model [71] is only applicable to bag-of-visual-word image 
representations and, therefore, cannot leverage recent state- 
of-the-art image features such as the Fisher vector [50]. We 
will use such features in our experiments. Finally, the latent 
SVM framework of Wang and Mori [65] is not applicable 
to zero-shot learning, the focus of this work. 

Several works have also considered the problem of dis¬ 
covering a vocabulary of attributes [5], [16], [36]. [5] lever¬ 
ages text and images sampled from the Internet and uses 
the mutual information principle to measure the information 
of a group of attributes. [16] discovers local attributes 
and integrates humans in the loop for recommending the 
selection of attributes that are semantically meaningful. [36] 
discovers attributes from images, textual comments and 
ratings for the purpose of aesthetic image description. In our 
work, we assume that the class-attribute association matrix 
is provided. In this sense, our work is complementary to 
those previously mentioned. 
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2.2 Zero-shot learning 

Zero-shot learning requires the ability to transfer knowledge 
from classes for which we have training data to classes 
for which we do not. There are two crucial choices when 
performing zero-shot learning: the choice of the prior 
information and the choice of the recognition model. 

Possible sources of prior information include attributes 
[30], [18], [43], [47], [46], semantic class taxonomies [46], 
[39], class-to-class similarities [47], [70], text features [43], 
[47], [46], [57], [20] or class co-occurrence statistics [37]. 
Rohrbach et al. [46] compare different sources of infor¬ 
mation for learning with zero or few samples. However, 
since different models are used for the different sources 
of prior information, it is unclear whether the observed 
differences are due to the prior information itself or the 
model. In our work, we compare attributes, class hierarchies 
and textual information obtained from the internet using 
the exact same learning framework and we can, therefore, 
fairly compare different sources of prior information. Other 
sources of prior information have been proposed for special 
purpose problems. For instance, Larochelle et al. [31] 
encode characters with 7x5 pixel representations. However, 
it is difficult to extend such an embedding to the case of 
generic visual categories - our focus in this work. For a 
recent survey of different output embeddings optimized for 
zero-shot learning on fine-grained datasets, the reader may 
refer to [2]. 

As for the recognition model, there are several alter¬ 
natives. As mentioned earlier, DAP uses a probabilistic 
model which assumes attribute independence [30]. Closest 
to the proposed ALE are those works where zero-shot 
recognition is performed by assigning an image to its 
closest class embedding (see next section). The measure 
of distance between an image and a class embedding 
is generally measured as the Euclidean distance and a 
transformation is learned to map the input image features 
to the class embeddings [43], [57]. The main difference 
between these works and ours is that we learn the input- 
to-output mapping features to optimize directly an image 
classification criterion: we learn to rank the correct label 
higher than incorrect ones. We will see in section 5.3 that 
this leads to improved results compared to those works 
which optimize a regression criterion such as [43], [57]. 

Few works have considered the problem of transitioning 
from zero-shot learning to learning with few shots [71], 
[54], [70]. As mentioned earlier, [71] is only applicable to 
bag-of-words type of models. [54] proposes to augment the 
attribute-based representation with additional dimensions 
for which an autoencoder model is coupled with a large 
margin principle. While this extends DAP to learning with 
labeled data, this approach does not improve DAP for zero- 
shot recognition. In contrast, we show that the proposed 
ALE can transition from zero-shot to few-shots learning 
and improves on DAP in the zero-shot regime. [70] learns 
separately the class embeddings and the input-to-output 
mapping which is suboptimal. In this paper, we learn jointly 
the class embeddings (using attributes as prior) and the 


input-to-output mapping to optimize classification accuracy. 

2.3 Label embedding 

In computer vision, a vast amount of work has been devoted 
to input embedding, i.e. how to represent an image. This 
includes work on patch encoding (see [8] for a recent 
comparison), on kernel-based methods [55] with a recent 
focus on explicit embeddings [35], [60], on dimension¬ 
ality reduction [55] and on compression [26], [49], [61]. 
Comparatively, much less work has been devoted to label 
embedding. 

Provided that the embedding function if is chosen cor¬ 
rectly - i.e. “similar” classes are close according to the 
Euclidean metric in the embedded space - label embedding 
can be an effective way to share parameters between 
classes. Consequently, the main applications have been 
multiclass classification with many classes[3], [66], [67], 
[4] and zero-shot learning [31], [43]. We now provide a 
taxonomy of embeddings. While this taxonomy is valid 
for both input 0 and output embeddings f, we focus here 
on output embeddings. They can be (i) fixed and data- 
independent, (ii) learned from data, or (iii) computed from 
side information. 

Data-Independent Embeddings. Kernel dependency es¬ 
timation [68] is an example of a strategy where f is 
data-independent and defined implicitly through a kernel 
in the y space. The compressed sensing approach of 
Hsu et al. [25], is another example of data-independent 
embeddings where f corresponds to random projections. 
The Error Correcting Output Codes (ECOC) framework en¬ 
compasses a large family of embeddings that are built using 
information-theoretic arguments [22]. ECOC approaches 
allow in particular to tackle multi-class learning problems 
as described by Dietterich and Bakiri in [14]. The reader 
can refer to [17] for a summary of ECOC methods and 
latest developments in the ternary output coding methods. 
Other data-independent embeddings are based on pairwise 
coupling and variants thereof such as generalized Bradley- 
Terry models [23]. 

Learned Embeddings. A strategy consists in learning 
jointly 0 and f to embed the inputs and outputs in a 
common intermediate space Z. The most popular exam¬ 
ple is Canonical Correlation Analysis (CCA) [23], which 
maximizes the correlation between inputs and outputs. 
Other strategies have been investigated which maximize 
directly classification accuracy, including the nuclear norm 
regularized learning of Amit et al. [3] or the WSABIE 
algorithm of Weston et al. [67]. 

Embeddings Derived From Side Information. There 
are situations where side information is available. This 
setting is particularly relevant when little training data is 
available, as side information and the derived embeddings 
can compensate for the lack of data. Side information 
can be obtained at an image level [18] or at a class 
level [30]. We focus on the latter setting which is more 
practical as collecting side information at an image level is 
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more costly. Side information may include “hand-drawn” 
descriptions [31], text descriptions [18], [30], [43], [20] 
or class taxonomies [66], [4]. Certainly, the closest work 
to ours is that of Frome et al. [20] ^ which involves 
embedding classes using textual corpora and then learning a 
mapping between the input and output embeddings using a 
ranking objective function. We also use a ranking objective 
function and compare different sources of side information 
to perform embedding: attributes, class taxonomies and 
textual corpora. 

While our focus is on embeddings derived from side 
information for zero-shot recognition, we also considered 
data independent embeddings and learned embeddings (us¬ 
ing side information as a prior) for few-shots recognition. 

3 Label embedding with attributes 

Given a training set S = = 1 • • • of 

input/output pairs with Xn ^ ^ and yn G y, our goal is to 
learn a function f \ X ^ y hy minimizing an empirical 
risk of the form 

1 ^ 

mn — y] A(y„,/(a;„)) (1) 

n=l 

where A : y x y ^ R measures the loss incurred from 
predicting f{x) when the true label is y, and where the 
function / belongs to the function T. We shall use the 0/1 
loss as a target loss: A(^, 2 ;) = 0 if ^ = 2 ;, 1 otherwise, to 
measure the test error, while we consider several surrogate 
losses commonly used for structured prediction at learning 
time (see Sec. 3.3 for details on the surrogate losses used 
in this paper). 

An elegant framework, initially proposed in [68], allows 
to concisely describe learning problems where both input 
and output spaces are jointly or independently mapped 
into lower-dimensional spaces. The framework relies on so- 
called embedding functions 0 : X ^ X and (f : y y 
resp for the inputs and outputs. Thanks to these embed¬ 
ding functions, the learning problem is cast into a regular 
learning problem with transformed input/output pairs. 

In what follows, we first describe our function class 
T (section3.1). We then explain how to leverage side 
information under the form attributes to compute label 
embeddings (section 3.2). We also discuss how to learn 
the model parameters (section 3.3). While, for the sake 
of simplicity, we focus on attributes in this section, the 
approach readily generalizes to any side information that 
can be encoded in matrix form (see following section 4). 

3.1 Framework 

Figure 1 illustrates the proposed model. Inspired from 
the structured prediction formulation [58], we introduce a 
compatibility function F : X x y R and define / as 
follows: 

f{x; w) = arg max F(x, y] w) (2) 

yey 

1. Note that the work of Frome et al. [20] is posterior to our conference 
submission [1]. 


where w denotes the model parameter vector of F and 
F{x,y;w) measures how compatible is the pair (x^y) 
given re. It is generally assumed that F is linear in some 
combined feature embedding of inputs/outputs f){x,y): 

F{x,y;w) = w''tp{x,y) (3) 

and that the joint embedding can be written as the tensor 
product between the image embedding 0 : X X = 
and the label embedding ip : y ^ y = 

'fix.y) = O{x)0ip{y) ( 4 ) 

and f){x^y) : R^ x R^ R^^. In this case re is a DE- 
dimensional vector which can be reshaped into sl D x E 
matrix W. Consequently, we can rewrite F{x,y;w) as a 
bilinear form: 

F{x,y;W)=0{xyWip{y). (5) 

Other compatibility functions could have been considered. 
For example, the function: 

F{x,y,W) =-\\0{xyW - ^{y)f (6) 

is typically used in regression problems. 

Also, if D and E are large, it might be valuable to 
consider a low-rank decomposition W = U'V to reduce the 
effective number of parameters. In such a case, we have: 

F{x, y-U, V) = {Ue{x)y (V^iy)). (7) 

CCA [23], or more recently WSABIE [67] rely, for exam¬ 
ple, on such a decomposition. 

3.2 Embedding classes with attributes 

We now consider the problem of defining the label em¬ 
bedding function ip'^ from attribute side information. In 
this case, we refer to our approach as Attribute Label 
Embedding (ALE). 

We assume that we have C classes, i.e. y = {1,..., C} 
and that we have a set of E attributes = {ai^i = 1... E} 
to describe the classes. We also assume that we are provided 
with an association measure pyp between each attribute 
and each class y. These associations may be binary or real¬ 
valued if we have information about the association strength 
{e.g. if the association value is obtained by averaging votes). 
We embed class y in the E^-dim attribute space as follows: 

^^{v) = [Py,l,---^Pv,E\ ( 8 ) 

and denote the E x C matrix of attribute embeddings 
which stacks the individual ip'^{yys. 

We note that in equation (5) the image and label embed¬ 
dings play symmetric roles. In the same way it makes sense 
to normalize samples when they are used as input to large- 
margin classifiers, it can make sense to normalize the output 
vectors (p^{y). In section 5.3 we compare (i) continuous 
embeddings, (ii) binary embeddings using {0,1} for the 
encoding and (iii) binary embeddings using { —for 
the encoding. We also explore two normalization strategies: 
(i) mean-centering {i.e. compute the mean over all learning 
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classes and subtract it) and (ii) -^ 2 -normalization. We under¬ 
line that such encoding and normalization choices are not 
arbitrary but relate to prior assumptions we might have on 
the problem. For instance, underlying the {0,1} embedding 
is the assumption that the presence of the same attribute in 
two classes should contribute to their similarity, but not its 
absence. Here we assume a dot-product similarity between 
attribute embeddings which is consistent with our linear 
compatibility function (5). Underlying the { — 1,1} embed¬ 
ding is the assumption that the presence or the absence of 
the same attribute in two classes should contribute equally 
to their similarity. As for mean-centered attributes, they take 
into account the fact that some attributes are more frequent 
than others. For instance, if an attribute appears in almost all 
classes, then in the mean-centered embedding, its absence 
will contribute more to the similarity than its presence. This 
is similar to an IDF effect in TF-IDF encoding. As for the 
^ 2 -normalization, it enforces that each class is closest to 
itself according to the dot-product similarity. 

In the case where attributes are redundant, it might be 
advantageous to de-correlate them. In such a case, we make 
use of the compatibility function (7). The matrix V may 
be learned from labeled data jointly with U. As a simpler 
alternative, it is possible to first learn the decorrelation, e.g. 
by performing a Singular Value Decomposition (SVD) on 
the matrix, and then to learn U. We will study the effect 
of attribute de-correlation in our experiments. 

3.3 Learning algorithm 

We now turn to the estimation of the model parameters W 
from a labeled training set S. The simplest learning strategy 
is to maximize directly the compatibility between the input 
and output embeddings: 

1 ^ 

-J2F{x„,y^-,W) (9) 

n=l 

with potentially some constraints and regularizations on 
W. This is exactly the strategy adopted in regression [43], 
[57]. However, such an objective function does not optimize 
directly our end-goal which is image classification. There¬ 
fore, we draw inspiration from the WSABIE algorithm [67] 
that learns jointly image and label embeddings from data 
to optimize classification accuracy. The crucial difference 
between WSABIE and ALE is the fact that the latter uses 
attributes as side information. Note that the proposed 
ALE is not tied to WSABIE and that we report results 
in 5.3 with other objective functions including regression 
and structured SVM (SSVM). We chose to focus on the 
WSABIE objective function with ALE because it yields 
good results and is scalable. 

In what follows, we briefiy review the WSABIE objective 
function [67] . Then, we present ALE which allows to do (i) 
zero-shot learning with side information and (ii) learning 
with few (or more) examples with side information. We, 
then, detail the proposed learning procedures for ALE. In 
what follows, ^ is the matrix which stacks the embeddings 


WSABIE. Let \{u) = \ if u is true and 0 otherwise. Let: 

^{xn, yn,y) = Myn, y) + d{x)'wy{y) - ( 10 ) 


Let r{xn^yn) be the rank of label pn for image 
Finally, let o^i, 0 ^ 2 ,. •., fxc be a sequence of C non-negative 
coefficients and let pk = Usunier et al. [59] 

propose to use the following ranking loss for S: 


1 

N 


N 

E 

n=l 


Pr{oL 


1 5?/n) 


( 11 ) 


where /3r(®„,y„) := Since the ^fe’s are in- 

creasing with k, minimizing Pr{xr,,yn) enforces to minimize 
the r(xn, ^n)’s, i.e. it enforces correct labels to rank higher 
than incorrect ones, quantifies the penalty incurred by 
going from rank k to k^l. Hence, a decreasing sequence 
0^1 > 0^2 > • • • > > 0 implies that a mistake on the 

rank when the true rank is at the top of the list incurs a 
higher loss than a mistake on the rank when the true rank is 
lower in the list - a desirable property. Following Usunier 
et ai, we choose = l/k. 

Instead of optimizing an upper-bound on (11), Weston 
et al. propose to optimize the following approximation of 
objective (11): 


1 ^ S 

R{S; W,<^) = -J2 E “^{0’ y- y)} 

XA{x^,y„) 

( 12 ) 

where 


'^A(^n: Vn) — ^ ^ Vn-) V) ^ 0) (15) 

yey 

is an upper-bound on the rank of label for image 
The main advantage of the formulation (12) is that it 
can be optimized efficiently through Stochastic Gradient 
Descent (SGD), as described in Algorithm 1. The label 
embedding space dimensionality is a parameter to set, 
for instance using cross-validation. Note that the previous 
objective function does not incorporate any regularization 
term. Regularization is achieved implicitly by early stop¬ 
ping, i.e. the learning is terminated once the accuracy stops 
increasing on the validation set. 

ALE: Zero-Shot Learning. We now describe the ALE 
objective for zero-shot learning. In such a case, we cannot 
learn 4> from labeled data, but rely on side information. This 
is in contrast to WSABIE. Therefore, the matrix ^ is fixed 
and set to (see section 3.2 for details on 4>‘^). We only 
optimize the objective (12) with respect to W. We note that, 
when $ is fixed and only W is learned, the objective (12) 
is closely related to the (unregularized) structured SVM 
(SSVM) objective [58]: 



The main difference is the loss function, which is the 
multi-class loss function for SSVM. The multi-class loss 
function focuses on the score with the highest rank, while 
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Algorithm 1 ALE stochastic training 

Intitialize randomly, 
for t = 1 to T do 


Draw {x,y) from S. 


for /c = 1, 2,.., 

, C - 1 do 


Draw y ^ y 

from y 


if i{x,y,y) 

> 0 then 


H Update W 



L k 

i^0{x)[(p{y) - (p{y)]' 

H Update 4> (not applicable to zero-shot) 


(1 - ^\y) + nty^p^iy) 

+ 


W'e(x) 

ip^^\y) = 

(1 - ^\y) + ytHf-^iy) 

- 


w'e{x) 

end if 




end for 
end for 


(16) 


(17) 


(18) 


ALE considers all scores in a weighted fashion. Similar to 
WSABIE, a major advantage of ALE is its scalability to 
large datasets [67], [44]. 

ALE: Few-Shots Learning. We now describe the ALE 
objective to the case where we have labeled data and side 
information. In such a case, we want to learn the class 
embeddings using as prior information We, therefore, 
add to the objective (12) a regularizer: 

+ (15) 

and optimize jointly with respect to W and Note that the 
previous equation is somewhat reminiscent of the ranking 
model adaptation of [21]. 

Training. Eor the optimization of the zero-shot as well as 
the few-shots learning, we follow [67] and use Stochastic 
Gradient Descent (SGD). Training with SGD consists at 
each step t in (i) choosing a sample (x, y) at random, (ii) 
repeatedly sampling a negative class denoted y with y / 
y until a violating class is found, i.e. until i{x^y^y) > 
0, and (iii) updating the projection matrix (and the class 
embeddings in case of few-shots learning) using a sample- 
wise estimate of the regularized risk. Eollowing [67], [44], 
we use a constant step size r]t = V- The detailed algorithm 
is provided in Algorithm 1. 

4 Label embedding beyond attributes 

A wealth of label embedding methods have been proposed 
over the years, in several communities and most often for 
different purpose. Previous works considered either fixed 
(data-independent) or learned-from-data embeddings. Data 
used for learning could be either restricted to the task-at- 
hand or could also be complemented by side information 
from other modalities. The purpose of this paper is to 
propose a general framework that encompasses all these 



Fig. 2. Illustration of Hierarchical Label Embedding 
(HLE). In this example, given 7 classes (including 
a “root” class), class 6 is encoded in a binary 7- 
dimensional space as = [1,0,1,0,0,1,0]. 

approaches, and compare the empirical performance on 
image classification tasks. Label embedding methods could 
be organized according to two criteria: i) task-focused or 
using other sources of side information; ii) fixed or data- 
dependent embedding. 

4.1 Side information in iabei embedding 

A first criterion to discriminate among the different ap¬ 
proaches for label embedding is whether the method is 
using only the training data for the task at hand, that is the 
examples (images) along with their class labels, or if it is 
using other sources of information. In the latter option, side 
information impacts the outputs, and can rely on several 
types of modalities. In our setting, these modalities could 
be i) attributes, ii) class taxonomies or iii) textual corpora, 
i) was the focus of the previous section (see especially 3.2). 
In what follows, we focus on ii) and iii). 

Class hierarchical structures explicitly use expert knowl¬ 
edge to group the image classes into a hierarchy, such as 
knowledge from ornithology for birds datasets. A hierarchi¬ 
cal structure on the classes requires an ordering operation 
-< my\ z y means that 2 : is an ancestor of y in the tree 
hierarchy. Given this tree structure, we can define = 1 
if z < y or z = y. The hierarchy embedding (p^{y) can be 
defined as the C dimensional vector: 

= (19) 

Here, is the association measure of the node in the 
hierarchy with class y. See Eigure 2 for an illustration. We 
refer to this embedding as Hierarchy Label Embedding 
(HLE). Note that HLE was first proposed in the context of 
structured learning [58]. Note also that, if classes are not 
organized in a tree structure but form a graph, other types 
of embeddings can be used, for instance by performing a 
kernel PCA on the commute time kernel [48]. 

The co-occurrence of class names in textual corpora 

can be automatically extracted using field guides or public 
resources such as Wikipedia Co-occurences of class 
names can be leveraged to infer relationships between 
classes, leading to an embedding of the classes. Stan¬ 
dard approaches to produce word embeddings from co- 
ocurrences include Latent Semantic Analyis (LSA) [12], 
probabilistic Latent Semantic Analysis (pLSA) [24] or 

2. http://en.wikipedia.org 
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Latent Dirichlet Allocation (LDA) [6] . In this work, we use 
the recent state-of-the-art approach of Mikolov et al. [40], 
also referred to as “Word2Vec”. It uses a skip-gram model 
that enforces a word (or a phrase) to be a good predictor 
of its surrounding words, i.e. it enforces neighboring words 
(or phrases) to be close to each other in the embedded 
space. Such an embedding , which we refer to as Word2Vec 
Label Embedding (WLE), was recently used for zero-shot 
recognition [20] on fine-grained datasets [2]. 

In section 5, we compare attributes, class hierarchies 
and textual information (i.e. resp. ALE, HLE and WLE) 
as sources of side information for zero-shot recognition. 

4.2 Data-dependence of label embedding 

A second criterion is whether the label embedding used 
at prediction time was fit to training data at training 
time or not. Here, being data-dependent refers to the 
training data, putting aside all other possibles sources of 
information. There are several types of approaches in this 
respect: i) fixed and data-independent label embeddings; 
ii) data-dependent, learnt solely from training data; iii) 
data-dependent, learnt jointly from training data and side 
information. 

Eixed and data-independent correspond to fixed map¬ 
pings of the original class labels to a lower-dimensional 
space. In our experiments, we explore three of such kind 
of embeddings: i) trivial label embedding corresponding to 
identity mapping, which boils down to plain one-versus- 
rest classification (OVR); ii) Gaussian Label Embedding 
(GLE), using Gaussian random projection matrices and 
assuming Johnson-Lindenstrauss properties; iii) Hadamard 
Label embedding, similarly, using Hadamard matrices for 
building the random projection matrices. None of these 
three label embedding approaches use the training data (nor 
any side information) to build the label embedding. It is 
worthwhile to note that the underlying dimensions of these 
label embedding do rely on training data, since they are 
usually cross-validated; we shall however ignore this fact 
here for simplicity of the exposition. 

Data-dependent label embedding use the training data to 
build the label embedding used at prediction time. Popular 
methods in this family are principal component analysis 
on the outputs, and canonical correlation analysis, and the 
plain WSABIE approach. 

Note that it is possible to use both the available training 
data and side information to learn the embedding func¬ 
tions. The proposed family of approaches. Attribute Label 
Embedding (ALE), belongs to this latter category. 

Combining embeddings. Different embeddings can be 
easily combined in the label embedding framework, e.g. 
through simple concatenation of the different embeddings 
or through more complex operations such as a CCA of 
the embeddings. This is to be contrasted with DAP which 
cannot accommodate so easily other sources of prior infor¬ 
mation. 


5 Experiments 

We now evaluate the proposed ALE framework on two 
public benchmarks: Animal With Attributes (AWA) and 
CUB-200-2011 (CUB). AWA [30] contains roughly 30,000 
images of 50 animal classes. CUB [63] contains roughly 
11,800 images of 200 bird classes. 

We first describe in sections 5.1 and 5.2 respectively 
the input embeddings (i.e. image features) and output 
embeddings that we have used in our experiments. In 
section 5.3, we present zero-shot recognition experiments, 
where training and test classes are disjoint. In section 5.4, 
we go beyond zero-shot learning and consider the case 
where we have plenty of training data for some classes 
and little training data for others. Einally, in section 5.5 we 
report results in the case where we have equal amounts of 
training data for all classes. 

5.1 Input embeddings 

Images are resized to lOOK pixels if larger while keeping 
the aspect ratio. We extract 128-dim SIET descriptors [33] 
and 96-dim color descriptors [11] from regular grids at 
multiple scales. Both of them are reduced to 64-dim using 
PCA. These descriptors are, then, aggregated into an image- 
level representation using the Eisher Vector (EV) [45], 
shown to be a state-of-the-art patch encoding technique 
in [8]. Therefore, our input embedding function 0 takes 
as input an image and outputs a EV representation. Using 
Gaussian Mixture Models with 16 or 256 Gaussians, we 
compute one SIET EV and one color EV per image and 
concatenate them into either 4,096 (4K) or 65,536-dim 
(64K) EVs. As opposed to [1], we do not apply PQ- 
compression which explains why we report better results 
in the current work (e.g. on average 2% better with the 
same output embeddings on CUB). 

5.2 Output Embeddings 

In our experiments, we considered three embeddings de¬ 
rived side information: attributes, class taxonomies and 
textual corpora. When considering attributes, we use the 
attributes (binary, or continuous) as they are provided with 
the datasets, with no further side information. 

Attribute Label Embedding (ALE). In AWA, each class 
was annotated with 85 attributes by 10 students [42]. Con¬ 
tinuous class-attribute associations were obtained by aver¬ 
aging the per-student votes and subsequently thresholded 
to obtain binary attributes. In CUB, 312 attributes were 
obtained from a bird field guide. Each image was annotated 
according to the presence/absence of these attributes. The 
per-image attributes were averaged to obtain continuous¬ 
valued class-attribute associations and thresholded with 
respect to the overall mean to obtain binary attributes. By 
default, we use continuous attribute embeddings in our 
experiments on both datasets. 

Hierarchical Label Embedding (HLE). We use the Word- 
net hierarchy as a source of prior information to compute 
output embeddings. We collect the set of ancestors of the 
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50 AWA (resp. 200 CUB) classes from Wordnet and build 
a hierarchy with 150 (resp. 299) nodes^. Hence, the output 
dimensionality is 150 (resp. 299) for AWA (resp. CUB). 
We compute the binary output codes following [58]: for a 
given class, an output dimension is set to {0,1} according 
the absence/presence of the corresponding node among 
the ancestors. The class embeddings are subsequently £ 2 - 
normalized. 

Word2Vec Label Embedding (WLE). We trained the 
skip-gram model on the 13 February 2014 version of the 
English-language Wikipedia which was tokenized to 1.5 
million words and phrases that contain the names of our 
visual object classes. Additionally we use a hierarchical 
softmax layer The dimensionality of the output embed¬ 
dings was cross-validated on a per-dataset basis. 

We also considered three data-independent embeddings: 

One-Vs-Rest embedding (OVR). The embedding dimen¬ 
sionality is C where C is the number of classes and the 
matrix is the C x C identity matrix. This is equivalent 
to training independently one classifier per class. 

Gaussian Label Embedding (GLE). The class embed¬ 
dings are drawn from a standard normal distribution, similar 
to random projections in compressed sensing [13]. Simi¬ 
larly to WSABIE, the label embedding dimensionality E is 
a parameter of GLE which needs to be cross-validated. For 
GLE, since the embedding is randomly drawn, we repeat 
the experiments 10 times and report the average (as well 
as the standard deviation when relevant). 

Hadamard Label Embedding. An Hadamard matrix is 
a square matrix whose rows/columns are mutually or¬ 
thogonal and whose entries are { — 1,1} [13]. Hadamard 
matrices can be computed iteratively with Hi = (1) 

1 TT ( Hnk — l Hnk — l \ ^ 

and = rj rj .In our experiments 

Y J^Q^k — l ll‘2k — l J 

Hadamard embedding yielded significantly worse results 
than GLE. Therefore, we only report GLE results in the 
following. 

Einally, when labeled training data is available in suf¬ 
ficient quantity, the embeddings can be learned from the 
training data. In this work, we considered one data-driven 
approach to label embedding: 

Web-Scale Annotation By Image Embedding (WSA¬ 
BIE). The objective function of WSABIE [67] is provided 
in (12) and the corresponding optimization algorithm is 
similar to the one of ALE described in Algorithm 1. 
The difference is that WSABIE does not use any prior 
information and, therefore, the regularization value /i is set 
to 0 in equations (17) and (18). Another difference with 
ALE is that the embedding dimensionality is a parameter 
of WSABIE which is obtained through cross-validation. 
This is an advantage of WSABIE since it provides an 

3. In some cases, some of the nodes have a single child. We did not 
clean the automatically obtained hierarchy. 

4. We obtain word2vec representations using the publicly available 
implementation from https : / /code . google . com/p/word2vec/. 




AWA 



FV=4K 

FV=64K 


^2 

cont 

{0,1} 

{-i,+i} 

cont 

{0,1} 

{-!,+!} 

no 

no 

41.5 

34.2 

32.5 

44.9 

42.4 

41.8 

yes 

no 

42.2 

33.8 

33.8 

44.9 

42.4 

42.4 

no 

yes 

45.7 

34.2 

34.8 

48.5 

44.6 

41.8 

yes 

yes 

44.2 

34.9 

34.9 

47.7 

44.8 

44.8 



CUB 



FV=4K 

FV=64K 

/i 

(-2 

cont 

{0.1} 

i-i.+ij 

cont 

{0,1} 

{-i.+i} 

no 

no 

17.2 

10.4 

12.8 

22.7 

20.5 

19.6 

yes 

no 

16.4 

10.4 

10.4 

21.8 

20.5 

20.5 

no 

yes 

20.7 

15.4 

15.2 

26.9 

22.3 

19.6 

yes 

yes 

20.0 

15.6 

15.6 

26.3 

22.8 

22.8 


TABLE 1 

Comparison of the continuous embedding (cent), the 
binary {0,1} embedding and the binary {+1, -1} 
embedding. We also study the impact of 
mean-centering (/i) and -^ 2 -normalization. 

additional free parameter compared to ALE. However, the 
cross-validation procedure is computationally intensive. 

In summary, in the following we report results for six 
label embedding strategies: ALE, HLE, WLE, OVR, GLE 
and WSABIE. Note that OVR, GLE and WSABIE are not 
applicable to zero-shot learning since they do not rely on 
any source of prior information and consequently do not 
provide a meaningful way to embed a new class for which 
we do not have any training data. 

5.3 Zero-Shot Learning 

Set-up. In this section, we evaluate the proposed ALE in the 
zero-shot setting. Eor AWA, we use the standard zero-shot 
setup which consists in learning parameters on 40 classes 
and evaluating accuracy on the 10 remaining ones. We use 
all the images in 40 learning classes (^ 24,700 images) to 
learn and cross-validate the model parameters. We then use 
all the images in 10 evaluation classes (« 6,200 images) 
to measure accuracy. Eor CUB, we use 150 classes for 
learning (^ 8,900 images) and 50 for evaluation (^ 2,900 
images). 

Comparison of output encodings for ALE. We first 
compare three different output encodings: (i) continuous 
encoding, i.e. we do not binarize the class-attribute as¬ 
sociations, (ii) binary {0,1} encoding and (iii) binary 
{ —1, +1} encoding. We also compare two normalizations: 
(i) mean-centering of the output embeddings and (ii) £ 2 - 
normalization. We use the same embedding and normaliza¬ 
tion strategies at training and test time. 

Results are shown in Table 1. The conclusions are the 
following ones. Significantly better results are obtained 
with continuous embeddings than with thresholded binary 
embeddings. On AWA with 64K-dim EV, the accuracy is 
48.5% with continuous and 41.8% with {—1,+1} embed¬ 
dings. Similarly on CUB with 64K-dim EV, we obtain 
26.9% with continuous and 19.6% with { —1,+1} em¬ 
beddings. This is expected since continuous embeddings 
encode the strength of association between a class and an 
attribute and, therefore, carry more information. We believe 
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RR 

SSVM 

RNK 

AWA 

44.5 

47.9 

48.5 

CUB 

21.6 

26.3 

26.3 


TABLE 2 

Comparison of different learning algorithms for ALE: 
ridge-regression (RR), multi-class SSVM (SSVM) and 
ranking based on WSABIE (RNK). 



Obj. pred. 

Att. pred. 


DAP 

ALE 

DAP 

ALE 

AWA 

41.0 

48.5 

72.7 

72.7 

CUB 

12.3 

26.9 

64.8 

59.4 


TABLE 3 

Comparison of DAP [30] with ALE. Left: object 
classification accuracy (top-1 %) on the 10 AWA and 
50 CUB evaluation classes. Right: attribute prediction 
accuracy (AUC %) on the 85 AWA and 312 CUB 
attributes. We use 64K FVs. 

that this is a major strength of the proposed approach as 
other algorithms such as DAP cannot accommodate such 
soft values in a straightforward manner. Mean-centering 
seems to have little impact with 0.8% (between 48.5% and 
47.7%) on AWA and 0.6% (between 26.9% and 26.3%) 
on CUB using 64K FV as input and continuous attributes 
as output embeddings. On the other hand, ^ 2 -normalization 
makes a significant difference in all configurations except 
from the { —1,+1} encoding (e.g. only 2.4% difference 
between 44.8% and 42.4% on AWA, 2.3% difference 
between 22.8% and 20.5% on CUB). This is expected, 
since all class embeddings already have a constant norm 
for { —1,+1} embeddings (the square-root of the number 
of output dimensions E). In what follows, we always use 
the continuous ^ 2 -normalized embeddings without mean¬ 
centric normalization. 

Comparison of learning algorithms. We now compare 
three objective functions to learn the mapping between 
inputs and outputs. The first one is Ridge Regression (RR) 
which was used in [43] to map input features to output 
attribute labels. In a nutshell, RR consists in optimizing a 
regularized quadratic loss for which there exists a closed 
form formula. The second one is the standard structured 
SVM (SSVM) multiclass objective function of [58]. The 
third one is the ranking objective (RNK) of WSABIE [67] 
which is described in detail section 3.3. The results are 
provided in Table 2. On AWA, the highest result is 48.5% 
obtained with RNK, followed by MUL with 47.9% whereas 
RR performs worse with 44.5%. On CUB, RNK and 
MUL obtain 26.3% accuracy whereas RR again performs 
somewhat worse with 21.6%. Therefore, the conclusion 
is that the multiclass and ranking frameworks are on-par 
and outperform the simple ridge regression. This is not 
surprising since the two former objective functions are more 
closely related to our end goal which is classification. In 
what follows, we always use the ranking framework (RNK) 
to learn the parameters of our model, since it both performs 
well and was shown to be scalable [67], [44]. 


Comparison with DAP. In this section we compare our 
approach to direct attribute prediction (DAP) [30]. We start 
by giving a short description of DAP and, then, present the 
results of the comparison. 

In DAP, an image x is assigned to the class y, which has 
the highest posterior probability: 

E 

p{y\x) oc jq p(ae = Py^e\x). (20) 

e=l 

py^e is the binary association measure between attribute Ue 
and class y. p{ae = l\x) is the probability that image x 
contains attribute e. We train for each attribute one linear 
classifier on the FVs. We use a (regularized) logistic loss 
which provides an attribute classification accuracy similar 
to SVM but with the added benefit that its output is already 
a probability. 

Table 3 (left) compares the proposed ALE to DAP for 
64K-dim FVs. Our implementation of DAP obtains 41.0% 
accuracy on AWA and 12.3% on CUB. Our result for DAP 
on AWA is comparable to the 40.5% accuracy reported by 
Lampert. Note however that the features are different. Lam- 
pert uses bag-of-features and a non-linear kernel classifier 
(X^ SVMs), whereas we use Fisher vectors and a linear 
SVM. Linear SVMs enable us to run experiments more 
efficiently. We observe that on both datasets, the proposed 
ALE outperforms DAP significantly: 48.5% 41.0% top- 

1 accuracy on AWA and 26.9% 12.3% on CUB. 

Attribute Correlation. While correlation in the input space 
is a well-studied topic, comparatively little work has been 
done to measure the correlation in the output space. Here, 
we reduce the output space dimensionality and study the 
impact on the classification accuracy. It is worth noting 
that reducing the output dimensionality leads to significant 
speed-ups at training and test times. We explore two 
different techniques: Singular Value Decomposition (SVD) 
and attribute sampling. We learn the SVD on AWA (resp. 
CUB) on the 50x85 (resp. 200x312) matrix. For the 
sampling, we sub-sample a fixed number of attributes and 
repeat the experiments 10 times for 10 different random 
sub-samplings. The results of these experiments are pre¬ 
sented in Figure 3. 

We can conclude that there is a significant amount 
of correlation between attributes. For instance, on AWA 
with 4K-dim FVs (Figure 3(a)) when reducing the output 
dimensionality to 25, we lose less than 2% accuracy and 
with a reduced dimensionality of 50, we perform even 
slightly better than using all the attributes. On the same 
dataset with 64K-dim FVs (Figure 3(c)) the accuracy drops 
from 48.5% to approximately 45% when reducing from 
an 85-dim space to a 25-dim space. More impressively, 
on CUB with 4K-dim FVs (Figure 3(b)) with a reduced 
dimensionality to 25, 50 or 100 from 312, the accuracy 
is better than the configuration that uses all the attributes. 
On the same dataset with 64K-dim FVs (Figure 3(d)), 
with 25 dimensions the accuracy is on par with the 312- 
dim embedding. SVD outperforms a random sampling of 
the attribute dimensions, although there is no guarantee 
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Fig. 3. Classification accuracy on AWA and CUB as a function of the label embedding dimensionality. We 
compare the baseline which uses all attributes, with an SVD dimensionality reduction and a sampling of attributes 
(we report the mean and standard deviation over 10 samplings). 



ALE 

HLE 

WLE 

AHLE 

early 

AHLE 

late 

AWA 

48.5 

40.4 

32.5 

46.8 

49.4 

CUB 

26.9 

18.5 

16.8 

27.1 

27.3 


TABLE 4 

Comparison of attributes (ALE), hierarchies (HLE) and 
Word2Vec (WLE) for label embedding. We consider 
the combination of ALE and HLE by simple 
concatenation (AHLE early) or by the averaging of the 
scores (AHLE late). We use 64K FVs. 

that SVD will select the most informative dimensions (see 
for instance the small pit in performance on CUB at 50 
dimensions). In random sampling of output embeddings, 
the choice of the attributes seems to be an important factor 
that affects the descriptive power of output embeddings. 
Consequently, the variance is higher (e.g. see Figures 3(a) 
and Figure 3(c) with a reduced attribute dimensionality of 
5 or 10) when a small number of attributes is selected. In 
the following experiments, we do not use dimensionality 
reduction of the attribute embeddings. 

Attribute interpretability. In ALE, each column of W 
can be interpreted as an attribute classifier and 0{xyW 
as a vector of attribute scores of x. However, one major 
difference with DAP is that we do not optimize for attribute 
classification accuracy. This might be viewed as a disad¬ 
vantage of our approach as we might loose interpretability, 
an important property of attribute-based systems when, for 
instance, one wants to include a human in the loop [7], [63]. 
We, therefore, measured the attribute prediction accuracy 
of DAP and ALE. For each attribute, following [30], we 
measure the AUC on the set of the evaluation classes and 
report the mean. 

Attribute prediction scores are shown in Table 3 (right). 
On AWA, the DAP and ALE methods obtain the same AUC 
accuracy of 72.7%. On the other hand, on CUB the DAP 
method obtains 64.8% AUC whereas ALE is 5.4% lower 
with 59.4% AUC. As a summary, the attribute prediction 
accuracy of DAP is at least as high as that of ALE. 
This is expected since DAP optimizes directly attribute- 
classification accuracy. However, the AUC for ALE is 
still reasonable, especially on AWA (performance is on 


par). Thus, our learned attribute classifiers should still be 
interpretable. We provide qualitative results on AWA in 
Figure 4: we show the four highest ranked images for 
some of the attributes with the highest AUC scores (namely 
>90%) and lowest AUC scores (namely <50%). 

Comparison of ALE, HLE and WLE. We now compare 
different sources of side information. Results are provided 
in Table 4. On AWA, ALE obtains 48.5% accuracy, HLE 
obtains 40.4% and WLE obtains 32% accuracy. On CUB, 
ALE obtains 26.9% accuracy, HLE obtains 18.5% and 
WLE obtains 16.8% accuracy. Note that in [1], we reported 
better results on AWA with HLE compared to ALE. The 
main difference with the current experiment is that we 
use continuous attribute encodings while [ ] was using a 
binary encoding. Note also that the comparatively poor 
performance of WLE with respect to ALE and HLE is 
not unexpected: while ALE and HLE rely on strong expert 
supervision, WLE is computed in an unsupervised manner 
from Wikipedia. 

We also consider the combination of attributes and hier¬ 
archies (we do not consider the combination of WLE with 
other embeddings given its relatively poor performance). 
We explore two simple alternatives: the concatenation of 
the embeddings (AHLE early) and the late fusion of classi¬ 
fication scores calculated by averaging the scores obtained 
using ALE and HLE separately (AHLE late). On both 
datasets, late fusion has a slight edge over early fusion and 
leads to a small improvement over ALE alone (+0.9% on 
AWA and +0.4% on CUB). 

In what follows, we do not report further results with 
WLE given its relatively poor performance and focus on 
ALE and HLE. 

Comparison with the state-of-the-art. We can compare 
our results to those published in the literature on AWA 
since we are using the standard training/testing protocol 
(there is no such zero-shot protocol on CUB). To the best 
of our knowledge, the best zero-shot recognition results on 
AWA are those of Yu et al. [70] with 48.3% accuracy. We 
report 48.5% with ALE and 49.4% with AHLE (late fusion 
of ALE and HLE). Note that we use different features. 
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Fig. 4. Sample attributes recognized with high (> 90%) accuracy (top) and low (/.e. <50%) accuracy (bottom) 
by ALE on AWA. For each attribute we show the images ranked highest. Note that a AUC < 50% means that 
the prediction is worse than random on average. The images whose attribute is predicted correctly are circled in 
green and those whose attribute is predicted incorrectly are circled in red. 



(a) AWA (FV=64K) 



(b) CUB (FV=64K) 

Fig. 5. Classification accuracy on AWA and CUB 
as a function of the number of training samples per 
class. To train the classifiers, we use all the images 
of the training “background” classes (used in zero-shot 
learning), and a small number of images randomly 
drawn from the relevant evaluation classes. Reported 
results are 10-way in AWA and 50-way in CUB. 

5.4 Few-Shots Learning 

Set-up. In these experiments, we assume that we have few 
(e.g. 2, 5, 10, etc.) training samples for a set of classes 
of interest (the 10 AWA and 50 CUB evaluation classes) 
in addition to all the samples from a set of “background 
classes” (the remaining 40 AWA and 150 CUB classes). 
For each evaluation class, we use approximately half of 


the images for training (the 2, 5, 10, etc. training samples 
are drawn from this pool) and the other half for testing. The 
minimum number of images per class in the evaluation set 
is 302 (AWA) and 42 (CUB). To have the same number of 
training samples, we use 100 images (AWA) and 20 images 
(CUB) per class as training set and the remaining images 
for testing. 

Algorithms. We compare the proposed ALE with three 
baselines: OVR, GLE and WSABIE. We are especially 
interested in analyzing the following factors: (i) the influ¬ 
ence of parameter sharing (ALE, GLE, WSABIE) vs. no 
parameter sharing (OVR), (ii) the influence of learning the 
embedding (WSABIE) having a fixed embedding (ALE, 
OVR and GLE), and (iii) the influence of prior information 
(ALE) no prior information (OVR, GLE and WSABIE) 

Eor ALE and WSABIE, W is initialized to the matrix 
learned in the zero-shot experiments. Eor ALE, we experi¬ 
mented with three different learning variations: 

• ALE(IU) consists in learning the parameters W and 
keeping the embedding fixed (4> = 4>'^). 

• ALE(4>) consists in learning the embedding parameters 
4> and keeping W fixed. 

• ALE(IU4>) consists in learning both W and 4>. 

While both ALE(IU) and ALE(4>) are implemented 
by stochastic (sub)gradient descent (see Algorithm 1 in 
Sec. 3.3), ALE(IU4>) is implemented by stochastic al¬ 
ternating optimization. Stochastic alternating optimization 
alternates between SGD for optimizing over the variable 
W and optimizing over the variable 4>. Theoretical con¬ 
vergence of SGD for ALE(IU) and ALE(4>) follows from 
standard results in stochastic optimization with convex non¬ 
smooth objectives [53], [52]. Theoretical convergence of 
the stochastic alternating optimization is beyond the scope 
of the paper. Experimental results show that the strategy 
actually works fine empirically. 

Results. We show results in Eigure 5 for AWA and CUB 
using 64K-dim features. We can draw the following con¬ 
clusions. Eirst, GLE underperforms all other approaches for 
limited training data which shows that random embeddings 
are not appropriate in this setting. Second, in general, 
WSABIE and ALE outperform OVR and GLE for small 
training sets (e.g. for less than 10 training samples) which 
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shows that learned embeddings (WSABIE) or embeddings 
based on prior information (ALE) can be effective when 
training data is scarce. Third, for tiny amounts of training 
data (e.g. 2-5 training samples per class), ALE outperforms 
WSABIE which shows the importance of prior information 
in this setting. Eourth, all variations of ALE - ALE(IE), 
ALE(T>) and ALE(IE^) - perform somewhat similarly. 
Eifth, as the number of training samples increases, all 
algorithms seem to converge to a similar accuracy, i.e. as 
expected parameter sharing and prior information are less 
crucial when training data is plentiful. 

5.5 Learning and testing on the fuli datasets 

In these experiments, we learn and test the classifiers on 
the 50 AWA (resp. 200 CUB) classes. Eor each class, 
we reserve approximately half of the data for training 
and cross-validation purposes and half of the data for 
test purposes. On CUB, we use the standard training/test 
partition provided with the dataset. Since the experimental 
protocol in this section is significantly different from the 
one chosen for zero-shot and few-shots learning, the results 
cannot be directly compared with those of the previous 
sections. 

Comparison of output encodings. We first compare dif¬ 
ferent encoding techniques (continuous embedding bi¬ 
nary embedding) and normalization strategies (with/without 
mean centering and with/without ^ 2 -normalization). The 
results are provided in Table 5. We can draw the following 
conclusions. 

As is the case for zero-shot learning, mean-centering 
has little impact and -^ 2 -normalization consistently improves 
performance, showing the importance of normalized out¬ 
puts. On the other hand, a major difference with the zero- 
shot case is that the {0,1} and continuous embeddings per¬ 
form on par. On AWA, in the 64K-dim EVs case, ALE with 
continuous embeddings leads to 53.3% accuracy whereas 
{0,1} embeddings leads to 52.5% (0.8% difference). On 
CUB with 64K-dim EVs, ALE with continuous embeddings 
leads to 21.6% accuracy while {0,1} embeddings lead 
to 21.4% (0.2% difference). This seems to indicate that 
the quality of the prior information used to perform label 
embedding has less impact when training data is plentiful. 

Comparison of output embedding methods. We now 

compare on the full training sets several learning algo¬ 
rithms: OVR, GEE with a costly setting E = 2, 500 output 
dimensions this was the largest output dimensionality al¬ 
lowing us to run the experiments in a reasonable amount 
of time), WSABIE (with cross-validated E), ALE (we use 
the ALE(IU) variant where the embedding parameters are 
kept fixed), HLE and AHLE (with early and late fusion). 
Results are provided in Table 6. 

We can observe that, in this setting, all methods perform 
somewhat similarly. Especially, the simple OVR and GEE 
baselines provide a competitive performance: OVR outper¬ 
forms all other methods on CUB and GEE performs best 
on AWA. This confirms that the quality of the embedding 
has little importance when training data is plentiful. 




AWA 



FV=4K 

FV=64K 


i2 

{0,1} 

cont 

{0,1} 

cont 

no 

no 

42.3 

41.6 

45.3 

46.2 

no 

yes 

44.3 

44.6 

52.5 

53.3 

yes 

no 

42.2 

41.6 

45.8 

46.2 

yes 

yes 

44.8 

44.5 

51.3 

52.0 



CUB 



FV=4K 

FV=64K 


£2 

{0,1} 

cont 

{0,1} 

cont 

no 

no 

13.0 

13.9 

16.5 

16.7 

no 

yes 

16.2 

17.5 

21.4 

21.6 

yes 

no 

13.2 

13.9 

16.5 

16.7 

yes 

yes 

16.1 

17.3 

17.3 

21.6 


TABLE 5 

Comparison of different output encodings: binary 
{0,1} encoding, continuous encoding, with/without 
mean-centering (/i) and with/without -^ 2 -normalization 



OVR 

GLE 

WSABIE 

ALE 

HLE 

AHLE 

early 

AHLE 

late 

AWA 

52.3 

56.1 

51.6 

52.5 

55.9 

55.3 

55.8 

CUB 

26.6 

22.5 

19.5 

21.6 

22.5 

24.6 

25.5 


TABLE 6 

Comparison of different output embedding methods 
(OVR, GLE, WSABIE, ALE, HLE, AHLE early and 
AHLE late ) on the full AWA and CUB datasets (resp. 
50 and 200 classes). We use 64K FVs. 

Reducing the training set size. We also studied the effect 
of reducing the amount of training data by using only 1/4, 
1/2 and 3/4 of the full training set. We therefore sampled 
the corresponding fraction of images from the full training 
set and repeated the experiments ten times with ten different 
samples. Eor these experiments, we report GLE results with 
two settings: using a low-cost setting, i.e. using the same 
number of output dimensions E as ALE (i.e. 85 for AWA 
and 312 for CUB) and using a high-cost setting, i.e. using 
a large number of output dimensions (E = 2^ 500 - see 
comment above about the choice of the value 2, 500). We 
show results in Eigure 6. 

On AWA, GLE outperforms all alternatives, closely 
followed by AHLE late. On CUB, OVR outperforms all 
alternatives, closely followed again by AHLE late. ALE, 
HLE and GLE with high-dimensional embeddings perform 
similarly. Eor these experiments, a general conclusion is 
that, when we use high dimensional features, even simple 
algorithms such as the OVR which are not well-justified 
for multi-class classification problems can lead to state-of- 
the-art performance. 

6 Conclusion 

We proposed to cast the problem of attribute-based classi¬ 
fication as one of label-embedding. The proposed Attribute 
Label Embedding (ALE) addresses in a principled fashion 
the limitations of the original DAP model. Eirst, we solve 
directly the problem at hand (image classification) without 
introducing an intermediate problem (attribute classifica¬ 
tion). Second, our model can leverage labeled training 
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(a) AWA (FV=64K) 



(b) CUB (FV=64K) 

Fig. 6. Learning on AWA and CUB using 1/4, 1/2, 
3/4 and all the training data. Compared output embed¬ 
dings: OVR, GLE, WSABIE, ALE, HLE, AHLE early 
and AHLE late. Experiments repeated 10 times for 
different sampling of Gaussians. We use 64K FVs. 
data (if available) to update the label embedding, using 
the attribute embedding as a prior. Third, the label em¬ 
bedding framework is not restricted to attributes and can 
accommodate other sources of side information such as 
class hierarchies or words embeddings derived from textual 
corpora. 

In the zero-shot setting, we improved image classification 
results with respect to DAP without losing attribute inter- 
pretability. Continuous attributes can be effortlessly used 
in ALE, leading to a large boost in zero-shot classification 
accuracy. As an addition, we have shown that the dimen¬ 
sionality of the output space can be significantly reduced 
with a small loss of accuracy. In the few-shots setting, 
we showed improvements with respect to the WSABIE 
algorithm, which learns the label embedding from labeled 
data but does not leverage prior information. 

Another important contribution of this work was to relate 
different approaches to label embedding: data-independent 
approaches (e.g. OVR, GLE), data-driven approaches (e.g. 
WSABIE) and approaches based on side information (e.g. 
ALE, HLE and WLE). We present here a unified framework 
allowing to compare them in a systematic manner. 

Learning to combine several inputs has been extensively 
studied in machine learning and computer vision, whereas 
learning to combine outputs is still largely unexplored. We 
believe that it is a worthwhile research path to pursue. 
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